{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Online Retail II Data Set\n",
    "\n",
    "In this notebook we will prepare and store the Online Retail II Data Set from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Online+Retail+II)\n",
    "\n",
    "\n",
    "**Citation:**\n",
    "\n",
    "Chen, D. Sain, S.L., and Guo, K. (2012), Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining, Journal of Database Marketing and Customer Strategy Management, Vol. 19, No. 3, pp. 197-208. https://link.springer.com/article/10.1057/dbm.2012.17. \n",
    "\n",
    "Chen,Daqing. (2019). Online Retail II. UCI Machine Learning Repository. https://doi.org/10.24432/C5CG6D.\n",
    "\n",
    "\n",
    "## Download the data\n",
    "\n",
    "- Navigate to the [data folder](https://archive.ics.uci.edu/dataset/502/online+retail+ii).\n",
    "- Click \"Download\".\n",
    "- Save the file in the same folder that contains this notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(1067371, 8)"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Load the data\n",
    "\n",
    "# The data is provided as two sheets in a single Excel file.\n",
    "# Load both and join into a single dataframe.\n",
    "\n",
    "# It takes a while...\n",
    "\n",
    "file = 'online_retail_II.xlsx'\n",
    "\n",
    "df_1 = pd.read_excel(file, sheet_name='Year 2009-2010')\n",
    "df_2 = pd.read_excel(file, sheet_name='Year 2010-2011')\n",
    "\n",
    "df = pd.concat([df_1, df_2])\n",
    "\n",
    "df.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Invoice</th>\n",
       "      <th>StockCode</th>\n",
       "      <th>Description</th>\n",
       "      <th>Quantity</th>\n",
       "      <th>InvoiceDate</th>\n",
       "      <th>Price</th>\n",
       "      <th>Customer ID</th>\n",
       "      <th>Country</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>489434</td>\n",
       "      <td>85048</td>\n",
       "      <td>15CM CHRISTMAS GLASS BALL 20 LIGHTS</td>\n",
       "      <td>12</td>\n",
       "      <td>2009-12-01 07:45:00</td>\n",
       "      <td>6.95</td>\n",
       "      <td>13085.0</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>489434</td>\n",
       "      <td>79323P</td>\n",
       "      <td>PINK CHERRY LIGHTS</td>\n",
       "      <td>12</td>\n",
       "      <td>2009-12-01 07:45:00</td>\n",
       "      <td>6.75</td>\n",
       "      <td>13085.0</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>489434</td>\n",
       "      <td>79323W</td>\n",
       "      <td>WHITE CHERRY LIGHTS</td>\n",
       "      <td>12</td>\n",
       "      <td>2009-12-01 07:45:00</td>\n",
       "      <td>6.75</td>\n",
       "      <td>13085.0</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>489434</td>\n",
       "      <td>22041</td>\n",
       "      <td>RECORD FRAME 7\" SINGLE SIZE</td>\n",
       "      <td>48</td>\n",
       "      <td>2009-12-01 07:45:00</td>\n",
       "      <td>2.10</td>\n",
       "      <td>13085.0</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>489434</td>\n",
       "      <td>21232</td>\n",
       "      <td>STRAWBERRY CERAMIC TRINKET BOX</td>\n",
       "      <td>24</td>\n",
       "      <td>2009-12-01 07:45:00</td>\n",
       "      <td>1.25</td>\n",
       "      <td>13085.0</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  Invoice StockCode                          Description  Quantity  \\\n",
       "0  489434     85048  15CM CHRISTMAS GLASS BALL 20 LIGHTS        12   \n",
       "1  489434    79323P                   PINK CHERRY LIGHTS        12   \n",
       "2  489434    79323W                  WHITE CHERRY LIGHTS        12   \n",
       "3  489434     22041         RECORD FRAME 7\" SINGLE SIZE         48   \n",
       "4  489434     21232       STRAWBERRY CERAMIC TRINKET BOX        24   \n",
       "\n",
       "          InvoiceDate  Price  Customer ID         Country  \n",
       "0 2009-12-01 07:45:00   6.95      13085.0  United Kingdom  \n",
       "1 2009-12-01 07:45:00   6.75      13085.0  United Kingdom  \n",
       "2 2009-12-01 07:45:00   6.75      13085.0  United Kingdom  \n",
       "3 2009-12-01 07:45:00   2.10      13085.0  United Kingdom  \n",
       "4 2009-12-01 07:45:00   1.25      13085.0  United Kingdom  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Inspect dataframe\n",
    "\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Retain customers from the UK\n",
    "\n",
    "df = df[df[\"Country\"]==\"United Kingdom\"]\n",
    "df.drop(\"Country\", axis=1, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Remove transactions without Customer ID\n",
    "\n",
    "df.dropna(subset=[\"Customer ID\"], inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Rename columns\n",
    "\n",
    "df.columns = [\n",
    "    \"invoice\",\n",
    "    \"stock_code\",\n",
    "    \"description\",\n",
    "    \"quantity\",\n",
    "    \"invoice_date\",\n",
    "    \"price\",\n",
    "    \"customer_id\",\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "# index rows with unique values\n",
    "\n",
    "df.reset_index(inplace=True, drop=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "# re-order columns\n",
    "\n",
    "ordered_cols = [\n",
    "    \"customer_id\",\n",
    "    \"invoice\",\n",
    "    \"invoice_date\",\n",
    "    \"stock_code\",\n",
    "    \"description\",\n",
    "    \"quantity\",\n",
    "    \"price\",\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "# save data with re-ordered columns\n",
    "\n",
    "df[ordered_cols].to_csv('retail.csv', index=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "fsml",
   "language": "python",
   "name": "fsml"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.5"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
